Préparation du jeu de données

On loade la matrice de reads par échantillon et par gènes obtenue par htseq-count. On supprime les dernières lignes qui correspondent aux reads non identifiés comme des gènes ou comme ambigus.

WT.K2 WT.K3 WT.N1 WT.N2 WT.N3
AT1G01010.1 5 2 3 6 3
AT1G01020.1 0 0 0 1 0
AT1G01020.2 0 0 0 0 0
AT1G01030.1 4 2 0 0 0
AT1G01040.1 15 8 18 14 14
AT1G01040.2 0 0 0 0 0

On rensigne le design et le type d’échantillons.

condition libType
WT.K2 untreated single-end
WT.K3 untreated single-end
WT.N1 treated single-end
WT.N2 treated single-end
WT.N3 treated single-end

Normalisation

D’après les 2 papiers de review lus (et détails techniques sur https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html) :

  • RPKM non adaptés pour la DE : Permettent la comparaison entre gènes d’un même échantillon en corrigeant la profondeur de séquençage et la taille des gènes, mais ne permet pas de comparer des échantillons différents. (La somme des counts normalisés en RPKM ne sera pas la même pour deux échantillons). De plus, on compare des gènes identiques uniquement, donc la taille n’est pas importante.

  • DESeq : la médiane de ratios : On calcule un sizeFactor propre à chaque échantillon pour corriger la profondeur de séquençage. On divise chaque count par la moyenne géométrique des counts de son gène parmis tous les échantillons (Comme si on créait un nouvel échantillon de pseudo-référence. Les ratios dans un même échantillon devraient être similaires entre la plupart des gènes, qui sont en majorité non differentiellement exprimés). Le size factor d’un échantillon est la médiane de ces ratios pour tous les gènes. Pour normaliser, on divise tous les counts d’un échantillon par son sizeFactor. (On ne considère pas dans la médiane les gènes ayant une moyenne géométrique de 0)

  • TMM :

à voir avec EdgeR.

    WT.K2     WT.K3     WT.N1     WT.N2     WT.N3 
1.2619801 0.4435494 1.2221518 1.1584047 1.4073152 

Estimation de la variance

The variance seen between counts is the sum of two components: the sample-to-sample variation just mentioned, and the uncertaintyin measuring a concentration by counting reads. The latter, known as shot noise or Poisson noise, is the dominatingnoise source for lowly expressed genes. The former dominates for highly expressed genes. The sum of both, shot noiseand dispersion, is considered in the differential expression inference.Hence, the variancevof count values is modelled as \(v = s\mu + \alpha s^2\mu^2\),

where \(\mu\) is the expected normalized count value (estimated by the average normalized count value),\(s\) is the size factor for the sample under consideration, and \(\alpha\) is the dispersion value for the gene under consideration.

# Analyse d’expression différentielle

On continue maintenant avec DESeq2, qui utilise le même test statistique que DESeq mais avec des améliorations pour les estimations et le GLM. Test : maintenant que les variances ont été estimées, il devient facile de comparer les conditions avec un test, rbinom.

A decrire mieux et parler des modèles, tout ça tout ça.

On construit le dataset, puis on enlève les lignes dont la somme des reads ne dépasse pas 10.

[1] 88

On a peu de gènes DE, seulement 48 avec un seuil à 5% de p-value, et 60 à 10%.

Visu du gènes le plus différentiellement exprimé :

On essaie de voir ce qui pourrait être étrange dans les données pour avoir si peu de gènes :

[1] 1240

En prenant une condition sur le log2FoldChange on a beaucoup plus de gènes DE. Affichage en heatmap des gènes les plus différentiellement exprimés.

On s’intéresse maintenant à la liste de gènes sélectionnés et à leur ontologie.

Genes DE et leur GO
ensembl_transcript_id description external_gene_name
AT1G06000.1 AT1G06000.1 Glycosyltransferase (Fragment) [Source:UniProtKB/TrEMBL;Acc:W8PVE7] UGT89C1
AT1G07520.1 AT1G07520.1 GRAS family transcription factor [Source:TAIR;Acc:AT1G07520]
AT1G08090.1 AT1G08090.1 High-affinity nitrate transporter 2.1 [Source:UniProtKB/Swiss-Prot;Acc:O82811] NRT2.1
AT1G08592.1 AT1G08592.1 Potential natural antisense gene, locus overlaps with AT1G08590 [Source:TAIR;Acc:AT1G08592]
AT1G15550.1 AT1G15550.1 Gibberellin 3-beta-dioxygenase 1 [Source:UniProtKB/Swiss-Prot;Acc:Q39103] GA3OX1
AT1G21310.1 AT1G21310.1 Extensin-3 [Source:UniProtKB/Swiss-Prot;Acc:Q9FS16] EXT3
AT1G24290.1 AT1G24290.1 AAA-type ATPase family protein [Source:UniProtKB/TrEMBL;Acc:O48696]
AT1G25550.1 AT1G25550.1 Transcription factor HHO3 [Source:UniProtKB/Swiss-Prot;Acc:Q9FPE8] HHO3
AT1G26250.1 AT1G26250.1 Proline-rich extensin-like family protein [Source:UniProtKB/TrEMBL;Acc:Q9C669]
AT1G29910.1 AT1G29910.1 Chlorophyll a-b binding protein 3, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q8VZ87] LHCB1.1
AT1G29930.1 AT1G29930.1 Chlorophyll a-b binding protein 1, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:P04778] LHCB1.3
AT1G35560.1 AT1G35560.1 Transcription factor TCP23 [Source:UniProtKB/Swiss-Prot;Acc:Q9LQF0] TCP23
AT1G37130.1 AT1G37130.1 Nitrate reductase [NADH] 2 [Source:UniProtKB/Swiss-Prot;Acc:P11035] NIA2
AT1G43710.1 AT1G43710.1 SDC1 [Source:UniProtKB/TrEMBL;Acc:A0A178WA97] SDC
AT1G50750.1 AT1G50750.1 Plant mobile domain protein family [Source:TAIR;Acc:AT1G50750]
AT1G55580.1 AT1G55580.1 SCL18 [Source:UniProtKB/TrEMBL;Acc:A0A178WEV4] SCL18
AT1G55920.1 AT1G55920.1 Serine acetyltransferase 1, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q42588] SAT1
AT1G55990.1 AT1G55990.1 Glycine-rich protein [Source:UniProtKB/TrEMBL;Acc:F4I3I2]
AT1G64185.1 AT1G64185.1 At1g64185 [Source:UniProtKB/TrEMBL;Acc:Q8LGF9]
AT1G64190.1 AT1G64190.1 6-phosphogluconate dehydrogenase, decarboxylating 1, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q9SH69] PGD1
AT1G64720.1 AT1G64720.1 Polyketide cyclase/dehydrase and lipid transport superfamily protein [Source:UniProtKB/TrEMBL;Acc:Q9XIR9] CP5
AT1G66180.1 AT1G66180.1 Eukaryotic aspartyl protease family protein [Source:UniProtKB/TrEMBL;Acc:Q9C8C9]
AT1G68670.1 AT1G68670.1 Transcription factor HHO2 [Source:UniProtKB/Swiss-Prot;Acc:Q8VZS3] HHO2
AT1G72450.1 AT1G72450.1 TIFY11B [Source:UniProtKB/TrEMBL;Acc:A0A178W2J3] TIFY11B
AT1G77760.1 AT1G77760.1 Nitrate reductase [Source:UniProtKB/TrEMBL;Acc:A0A178WBR8] NIA1
AT1G78050.1 AT1G78050.1 Phosphoglycerate/bisphosphoglycerate mutase [Source:UniProtKB/TrEMBL;Acc:F4I8M8] PGM
AT1G78060.1 AT1G78060.1 Probable beta-D-xylosidase 7 [Source:UniProtKB/Swiss-Prot;Acc:Q9SGZ5] BXL7
AT1G78090.1 AT1G78090.1 Trehalose-phosphate phosphatase B [Source:UniProtKB/Swiss-Prot;Acc:Q9C9S4] TPPB
AT1G80440.1 AT1G80440.1 F-box/kelch-repeat protein At1g80440 [Source:UniProtKB/Swiss-Prot;Acc:Q9M8L2]
AT2G15620.1 AT2G15620.1 Ferredoxin–nitrite reductase, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q39161] NIR1
AT2G17820.1 AT2G17820.1 Histidine kinase 1 [Source:UniProtKB/Swiss-Prot;Acc:Q9SXL4] AHK1
AT2G22500.1 AT2G22500.1 Mitochondrial uncoupling protein 5 [Source:UniProtKB/Swiss-Prot;Acc:Q9SJY5] PUMP5
AT2G24540.1 AT2G24540.1 F-box protein AFR [Source:UniProtKB/Swiss-Prot;Acc:Q8LAW2] AFR
AT2G27510.1 AT2G27510.1 Ferredoxin [Source:UniProtKB/TrEMBL;Acc:A0A178VWS0] FD3
AT2G30040.1 AT2G30040.1 Mitogen-activated protein kinase kinase kinase 14 [Source:UniProtKB/TrEMBL;Acc:O64741] MAPKKK14
AT2G30615.1 AT2G30615.1 F-box/LRR protein [Source:UniProtKB/TrEMBL;Acc:Q58G02]
AT2G34080.1 AT2G34080.1 Cysteine proteinases superfamily protein [Source:UniProtKB/TrEMBL;Acc:O22961]
AT2G34420.1 AT2G34420.1 Chlorophyll a-b binding protein, chloroplastic [Source:UniProtKB/TrEMBL;Acc:Q39141] LHB1B2
AT2G35637.1 AT2G35637.1 other RNA [Source:TAIR;Acc:AT2G35637]
AT2G35640.1 AT2G35640.1 Homeodomain-like superfamily protein [Source:UniProtKB/TrEMBL;Acc:Q9ZQN7]
AT2G36470.1 AT2G36470.1 DUF868 family protein, putative (DUF868) [Source:UniProtKB/TrEMBL;Acc:Q9SJQ8]
AT2G36570.1 AT2G36570.1 Leucine-rich repeat receptor-like protein kinase PXC1 [Source:UniProtKB/Swiss-Prot;Acc:Q9SJQ1] PXC1
AT2G36580.1 AT2G36580.1 Pyruvate kinase [Source:UniProtKB/TrEMBL;Acc:A0A178VX82]
AT2G36590.1 AT2G36590.1 Proline transporter 3 [Source:UniProtKB/Swiss-Prot;Acc:Q9SJP9] PROT3
AT2G38640.1 AT2G38640.1 Protein LURP-one-related 8 [Source:UniProtKB/Swiss-Prot;Acc:Q9ZVI6]
AT2G39210.1 AT2G39210.1 At2g39210/T16B24.15 [Source:UniProtKB/TrEMBL;Acc:O80960]
AT2G41440.1 AT2G41440.1 unknown protein; BEST Arabidopsis thaliana protein match is: unknown protein (TAIR:AT2G41470.1); Ha. [Source:TAIR;Acc:AT2G41440]
AT2G45685.1 AT2G45685.1 other RNA [Source:TAIR;Acc:AT2G45685]
AT3G02910.1 AT3G02910.1 Putative gamma-glutamylcyclotransferase At3g02910 [Source:UniProtKB/Swiss-Prot;Acc:Q9M8T3]
AT3G02920.2 AT3G02920.2 Replication protein A 32 kDa subunit B [Source:UniProtKB/Swiss-Prot;Acc:Q8LFJ8] RPA2B
AT3G07350.1 AT3G07350.1 F21O3.6 protein [Source:UniProtKB/TrEMBL;Acc:Q9SRT1]
AT3G16240.1 AT3G16240.1 Aquaporin TIP2-1 [Source:UniProtKB/Swiss-Prot;Acc:Q41951] TIP2-1
AT3G19430.1 AT3G19430.1 Late embryogenesis abundant protein-related / LEA protein-like protein [Source:UniProtKB/TrEMBL;Acc:A0A178VGJ3]
AT3G24520.1 AT3G24520.1 Heat stress transcription factor C-1 [Source:UniProtKB/Swiss-Prot;Acc:Q9LV52] HSFC1
AT3G27180.1 AT3G27180.1 At3g27180 [Source:UniProtKB/TrEMBL;Acc:B3DNP3]
AT3G28550.1 AT3G28550.1 Proline-rich extensin-like family protein [Source:UniProtKB/TrEMBL;Acc:F4J0B5]
AT3G47520.1 AT3G47520.1 Malate dehydrogenase, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q9SN86] MDH
AT3G47980.1 AT3G47980.1 At3g47980 [Source:UniProtKB/TrEMBL;Acc:Q84MC3]
AT3G48360.1 AT3G48360.1 BTB/POZ and TAZ domain-containing protein 2 [Source:UniProtKB/Swiss-Prot;Acc:Q94BN0] BT2
AT3G49940.1 AT3G49940.1 LOB domain-containing protein 38 [Source:UniProtKB/Swiss-Prot;Acc:Q9SN23] LBD38
AT3G61580.1 AT3G61580.1 Delta(8)-fatty-acid desaturase 1 [Source:UniProtKB/Swiss-Prot;Acc:Q9ZRP7] SLD1
AT4G13918.1 AT4G13918.1 Potential natural antisense gene, locus overlaps with AT4G13920 [Source:TAIR;Acc:AT4G13918]
AT4G25835.1 AT4G25835.1 AAA-ATPase At4g25835 [Source:UniProtKB/Swiss-Prot;Acc:Q8RY66]
AT4G29700.1 AT4G29700.1 Alkaline-phosphatase-like family protein [Source:UniProtKB/TrEMBL;Acc:Q9SU81]
AT4G31910.1 AT4G31910.1 Brassinosteroid-related acyltransferase 1 [Source:UniProtKB/Swiss-Prot;Acc:Q9SZ58] BAT1
AT4G32950.1 AT4G32950.1 Probable protein phosphatase 2C 61 [Source:UniProtKB/Swiss-Prot;Acc:O82637]
AT4G36850.1 AT4G36850.1 PQ-loop repeat family protein / transmembrane family protein [Source:UniProtKB/TrEMBL;Acc:Q94AH7]
AT4G37540.1 AT4G37540.1 LOB domain-containing protein 39 [Source:UniProtKB/Swiss-Prot;Acc:Q9SZE8] LBD39
AT4G37608.1 AT4G37608.1 Uncharacterized protein AT4g37600 [Source:UniProtKB/TrEMBL;Acc:Q9SZF3]
NA NA NA NA
AT4G38340.1 AT4G38340.1 Protein NLP3 [Source:UniProtKB/Swiss-Prot;Acc:Q9SVF1] NLP3
AT5G01215.1 AT5G01215.1 other RNA [Source:TAIR;Acc:AT5G01215]
AT5G01215.2 AT5G01215.2 other RNA [Source:TAIR;Acc:AT5G01215]
AT5G01542.1 AT5G01542.1 Potential natural antisense gene, locus overlaps with AT5G01540 [Source:TAIR;Acc:AT5G01542]
AT5G04840.1 AT5G04840.1 At5g04840 [Source:UniProtKB/TrEMBL;Acc:Q8L5X9]
AT5G09800.1 AT5G09800.1 U-box domain-containing protein 28 [Source:UniProtKB/Swiss-Prot;Acc:Q9LXE3] PUB28
AT5G13110.1 AT5G13110.1 Glucose-6-phosphate 1-dehydrogenase 2, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q9FY99] G6PD2
AT5G13420.1 AT5G13420.1 Aldolase-type TIM barrel family protein [Source:UniProtKB/TrEMBL;Acc:Q9LYR4]
AT5G14760.1 AT5G14760.1 L-aspartate oxidase, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q94AY1] AO
AT5G20540.1 AT5G20540.1 Protein Brevis radix-like 4 [Source:UniProtKB/Swiss-Prot;Acc:Q8GZ92] BRXL4
AT5G25460.1 AT5G25460.1 At5g25460 [Source:UniProtKB/TrEMBL;Acc:Q94F20]
AT5G47220.1 AT5G47220.1 ERF2 [Source:UniProtKB/TrEMBL;Acc:A0A178UCP3] ERF2
AT5G49480.1 AT5G49480.1 Calcium-binding protein CP1 [Source:UniProtKB/Swiss-Prot;Acc:Q9FDX6] ATCP1
AT5G51830.1 AT5G51830.1 Probable fructokinase-7 [Source:UniProtKB/Swiss-Prot;Acc:Q9FLH8]
AT5G54170.1 AT5G54170.1 Polyketide cyclase/dehydrase and lipid transport superfamily protein [Source:UniProtKB/TrEMBL;Acc:F4JYV3]
AT5G57770.1 AT5G57770.1 Plant protein of unknown function (DUF828) with plant pleckstrin homology-like region [Source:TAIR;Acc:AT5G57770]
AT5G63160.1 AT5G63160.1 BTB/POZ and TAZ domain-containing protein 1 [Source:UniProtKB/Swiss-Prot;Acc:Q9FMK7] BT1
AT5G67420.1 AT5G67420.1 LOB domain-containing protein 37 [Source:UniProtKB/Swiss-Prot;Acc:Q9FN11] LBD37
 

A work by Océane Cassan

oceane.cassan@supagro.fr